Skip to content

Conversation

@michaelnebel
Copy link
Contributor

@michaelnebel michaelnebel commented Nov 29, 2024

In this PR we change how much information we extract from reference assemblies.
Prior to this PR all symbols (also private an internal) were extracted from the reference assemblies even though not all of them are accessible outside the assemblies (e.g. it is not possible to call a private method outside the assembly).

With this PR, we only extract public and protected members ("effectively" public members and types) from the reference assemblies.

There is one identified drawback with this approach, which was identified by DCA:

  • If a project exposes its internals to another project and if the project that exposes its internals is referenced as a dll (and not by source), then we get missing call targets (and thereby unknown expression types). This happens in a few cases in our DCA suite (two projects from traced extraction were investigated)
    • Roslyn makes use of some dll's which were originally implemented in VB (this was mostly in tests).
    • Powershell makes use of one project, which is referenced by dll and which gives us eight missing call targets. The eight extra missing call targets are also due InternalsVisible (in this case internals from Microsoft.Management.Infrastructure are made accessible to System.Management.Automation, which is declared in the powershell project).

Making solutions in production code using InternalsVisible in an anti-pattern (note that we only get an issue in case internals are in reference assemblies). Furthermore, the "increase" in the number of missing call targets and expressions with unknown types for the buildless extractions are negligible.

The advantages are

  • DCA identifies a 10% overall speedup of analysis time.
  • DCA identified an approximately DB reduction of around 10%.
  • For .NET 9 we avoid DB inconsistencies (failing DB check) when including some of the standard Microsoft reference DLLs that have conflicting internal declarations (this will otherwise have to be fixed in another way) <-- This was the original motivation to look into this.

DCA doesn't show any alert changes to the security related queries.
IMO the advantages outweigh the drawbacks.

@hvitved @tamasvajk : What do you think?

@github-actions github-actions bot added the C# label Nov 29, 2024
@michaelnebel michaelnebel force-pushed the csharp/publicprotected branch from 662f21c to 129b85c Compare November 29, 2024 10:30
@michaelnebel michaelnebel force-pushed the csharp/publicprotected branch from 129b85c to a09262b Compare November 29, 2024 12:49
@hvitved
Copy link
Contributor

hvitved commented Dec 2, 2024

@hvitved @tamasvajk : What do you think?

I think this sounds overall like a (great) win, as DCA also confirms. The changes in results for cs/thread-unsafe-icryptotransform-field-in-class are merely the result of that query not filtering to fields from source code, like it should, so I have opened #18178 for that.

@tamasvajk
Copy link
Contributor

I vaguely remember that we started extracting private symbols because we needed them for stub generation. This is the PR where this was changed: #6000 and it also links to this #5664.

We needed these additional symbols when stubbing the .net framework, but I don't remember any longer for which API. The below might or might not be the case that caused the problem. I think when the below was extracted into a DB without private symbols, then we couldn't generate a stub that compiled.

public abstract class A
{
    public A(int i) => throw null;
}
public class B : A
{
    private B(int i) : base(i) => throw null; // not in the DB
    public B(int i, string s): this(i) => throw null;
}

It might not be a problem any longer for the stub generator, in which case the WithMetadataImportOptions might also give an alternative way of doing the filtering.

@michaelnebel
Copy link
Contributor Author

I vaguely remember that we started extracting private symbols because we needed them for stub generation. This is the PR where this was changed: #6000 and it also links to this #5664.

We needed these additional symbols when stubbing the .net framework, but I don't remember any longer for which API. The below might or might not be the case that caused the problem. I think when the below was extracted into a DB without private symbols, then we couldn't generate a stub that compiled.

public abstract class A
{
    public A(int i) => throw null;
}
public class B : A
{
    private B(int i) : base(i) => throw null; // not in the DB
    public B(int i, string s): this(i) => throw null;
}

It might not be a problem any longer for the stub generator, in which case the WithMetadataImportOptions might also give an alternative way of doing the filtering.

Thank you for the historical perspective! That makes sense.
The new stub generator that @hvitved made is no longer based on the extractor (or a DB produced by the extractor) - so that shouldn't be a problem.

@tamasvajk
Copy link
Contributor

One additional thing to consider: not extracting private fields can have an impact on MaD rows that use .Field[...] with a private field's name. I think these would need to be changed to .SyntheticField[...].

@michaelnebel
Copy link
Contributor Author

michaelnebel commented Dec 3, 2024

The QA run on 5k repos shows
image
Which indeed looks good (most likely the missing result for cs/threat-unsafe-icryptoptransform-field-in-class is due to the error in the query that @hvitved mentions above. I don't see any need to dig further into the results).

The following looks very good:
image

Faster extraction, faster trap import, faster query execution. Roughly 10% across the line and an 8% end-to-end performance improvement!

Link to QA dashboard: https://github.com/github/codeql-qa-ops/issues/240#issuecomment-2513329387

@michaelnebel
Copy link
Contributor Author

One additional thing to consider: not extracting private fields can have an impact on MaD rows that use .Field[...] with a private field's name. I think these would need to be changed to .SyntheticField[...].

That is a good point as well.
I tried to add the following code in the ModelValidation module in ExternalFlow.qll and execute this against a dotnet/runtime database (created from this branch) to check if there are any models pertaining to fields not extracted in dotnet/runtime.

  private predicate myTest(string field) {
    exists(AccessPath p, AccessPathToken token |
      token = p.getToken(_) and
      token.getName() = "Field" and
      field = token.getArgument(0) and
      not exists(Field f | f.getFullyQualifiedNameDebug() = field)
    )
  }

It appears that all the models we only use effectively public fields on dotnet/runtime: This makes sense as the only manual models we have that uses Field are for tuples (and these are public fields) and the model generator only creates models using Field in case a field is effectively public (otherwise it uses SyntheticField).

@michaelnebel michaelnebel marked this pull request as ready for review December 3, 2024 09:59
@michaelnebel michaelnebel requested a review from a team as a code owner December 3, 2024 09:59
Copy link
Contributor

@hvitved hvitved left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, very nice speedup, just a single question.

}
if (symbol is IMethodSymbol method)
{
return method.ExplicitInterfaceImplementations.Any(m => m.ContainingType.DeclaredAccessibility == Accessibility.Public);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be m => m.ContainingType.ShouldExtractSymbol() instead? Otherwise, I don't think we'll extract the implementation of M in

interface I {
    void M();
}

public class C {
    protected class Nested : I {
        void I.M() {}
    }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are absolutely right!

}
if (symbol is IPropertySymbol property)
{
return property.ExplicitInterfaceImplementations.Any(m => m.ContainingType.DeclaredAccessibility == Accessibility.Public);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

}

if (Symbol.BaseType is not null)
if (Symbol.BaseType is not null && Symbol.BaseType.ShouldExtractSymbol())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change required? The base type cannot be less accessible than the currently processed type: https://learn.microsoft.com/en-us/dotnet/csharp/misc/cs0060. Is there any case that we're extracting a type without its base type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least one of the intermediate solutions that I made required this (even though as you state - this shouldn't be required). I will try to make the suggested modifications on the PR and see if we get any failures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right - this is not required.

@michaelnebel michaelnebel merged commit 801f696 into github:main Dec 4, 2024
10 checks passed
@michaelnebel michaelnebel deleted the csharp/publicprotected branch December 4, 2024 08:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants